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Abstract —Anticipating the future actions of a human is 
a widely studied problem in robotics that requires spatio- 
temporal reasoning. In this work we propose a deep learning 
approach for anticipation in sensory-rich robotics applications. 
We introduce a sensory-fusion architecture which jointly learns 
to anticipate and fuse information from multiple sensory 
streams. Our architecture consists of Recurrent Neural Net¬ 
works (RNNs) that use Long Short-Term Memory (LSTM) 
units to capture long temporal dependencies. We train our 
architecture in a sequence-to-sequence prediction manner, and 
it explicitly learns to predict the future given only a partial 
temporal context. We further introduce a novel loss layer for 
anticipation which prevents over-fitting and encourages early 
anticipation. We use our architecture to anticipate driving 
maneuvers several seconds before they happen on a natural 
driving data set of 1180 miles. The context for maneuver 
anticipation comes from multiple sensors installed on the 
vehicle. Our approach shows significant improvement over 
the state-of-the-art in maneuver anticipation by increasing the 
precision from 77.4% to 90.5% and recall from 71.2% to 87.4%. 

1. Introduction 

Anticipating the future actions of a human is an important 
perception task and has many applications in robotics. It 
has enabled robots to navigate in a social manner and 
perform collaborative tasks with humans while avoiding 
conflicts ||24||4T]|2T1[40I. In another application, anticipating 
driving maneuvers several seconds in advance MMM 
[3^ enables assistive cars to alert drivers before they make 
a dangerous maneuver. Maneuver anticipation complements 
existing Advance Driver Assistance Systems (ADAS) by 
giving drivers more time to react to road situations and 
thereby can prevent many accidents |[30l . 

Activity anticipation is a challenging problem because it 
requires the prediction of future events from a limited tem¬ 
poral context. It is different from activity recognition iol, 
where the complete temporal context is available for pre¬ 
diction. Furthermore, in sensory-rich robotics settings, the 
context for anticipation comes from multiple sensors. In 
such scenarios the end performance of the application largely 
depends on how the information from different sensors are 
fused. Previous works on anticipation 1^ |2T] [24l usually 
deal with single-data modality and do not address anticipa¬ 
tion for sensory-rich robotics applications. Additionally, they 
learn representations using shallow architectures 
|24l that cannot handle long temporal dependencies (4). 

In order to address the anticipation problem more gener¬ 
ally, we propose a Recurrent Neural Network (RNN) based 
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Training example 

(Xi,X2,...,X7.) 


Test example 

(Xi,X2,...,Xt) 



Fig. 1: (Left) Shows training RNN for anticipation in a sequence-to- 
sequence prediction manner. The network explicitly learns to map 
the partial context (xi, ..,xt) Vt to the future event y. (Right) At 
test time the network’s goal is to anticipate the future event as soon 
as possible, i.e. by observing only a partial temporal context. 

architecture which learns rich representations for anticipa¬ 
tion. We focus on sensory-rich robotics applications, and our 
architecture learns how to optimally fuse information from 
different sensors. Our approach captures temporal dependen¬ 
cies by using Long Short-Term Memory (LSTM) units. We 
train our architecture in a sequence-to-sequence prediction 
manner (Figure such that it explicitly learns to anticipate 
given a partial context, and we introduce a novel loss layer 
which helps anticipation by preventing over-fitting. 

We evaluate our approach on the task of anticipating driv¬ 
ing maneuvers several seconds before they happen 113 [221 • 
The context (contextual information) for maneuver anticipa¬ 
tion comes from multiple sensors installed on the vehicle 
such as cameras, GPS, vehicle dynamics, etc. Information 
from each of these sensory streams provides necessary cues 
for predicting future maneuvers. Our overall architecture 
models each sensory stream with an RNN and then non- 
linearly combines the high-level representations from multi¬ 
ple RNNs to make a final prediction. 

We report results on 1180 miles of natural driving data 
collected from 10 drivers d). The data set is challenging 
because of the variations in routes and traffic conditions, 
and the driving styles of the drivers (Figure On this 
data set, our deep learning approach improves the state-of- 
the-art in maneuver anticipation by increasing the precision 
from 77.4% to 84.5% and recall from 71.2% to 77.1%. We 














further improved these results by extracting richer features 
from cameras such as the 3D head pose of the driver’s face. 
Including these features into our architecture increases the 
precision and recall to 90.5% and 87.4% respectively. Key 
contributions of this paper are: 

• A sensory-fusion RNN-LSTM architecture for anticipa¬ 
tion in sensory-rich robotics applications. 

• A new vision pipeline with rich features (such as 3D 
head pose) for maneuver anticipation. 

• State-of-the-art performance on maneuver anticipation 
on 1180 miles of driving data (13. 

II. Related Work 

Our work is related to previous works on anticipating hu¬ 
man activities, driver behavior understanding, and Recurrent 
Neural Networks (RNNs) for sequence prediction. 

Several works have studied human activity anticipation 
for human-robot collaboration and forecasting. Anticipating 
human activities has been shown to improve human-robot 
collaboration iol EB EH [ini. Similarly, forecasting human 
navigation trajectories has enabled robots to plan sociable 
trajectories around humans (201 0 EH- Feature matching 
techniques have been proposed for anticipating human ac¬ 
tivities from videos ED Approaches used in these works 
learn shallow architectures m that do not properly model 
temporal aspects of human activities. Furthermore, they deal 
with a single data modality and do not tackle the challenges 
of sensory-fusion. We propose a deep learning approach for 
anticipation which efficiently handles temporal dependencies 
and learns to fuse multiple sensory streams. 

We demonstrate our approach on anticipating driving 
maneuvers several seconds before they happen. This is a 
sensor-rich application for alerting drivers several seconds 
before they make a dangerous maneuvering decision. Pre¬ 
vious works have addressed maneuver anticipation (T] [T^ 
ED El E3 through sensory-fusion from multiple cameras, 
GPS, and vehicle dynamics. In particular, Morris et al. ED 
and Trivedi et al. Ell used a Relevance Vector Machine 
(RVM) for intent prediction and performed sensory fusion 
by concatenating feature vectors. 

More recently, Jain et al. CD showed that concatenation of 
sensory streams does not capture the rich context for mod¬ 
eling maneuvers. They proposed an Autoregressive Input- 
Output Hidden Markov Model (AIO-HMM) which fuses 
sensory streams through a linear transformation of features 
and it performs better than feature concatenation ED. In 
contrast, we learn an expressive architecture to combine 
information from multiple sensors. Our RNN-LSTM based 
sensory-fusion architecture captures long temporal depen¬ 
dencies through its memory cell and learns rich represen¬ 
tations for anticipation through a hierarchy of non-linear 
transformations of input data. Our work is also related 
to works on driver behavior prediction with different sen¬ 
sors ini m ini m, and vehicular controllers which act on 
these predictions (3^ fSE[ fTll . 

Two building blocks of our architecture are Recurrent 
Neural Networks (RNNs) (23 and Long Short-Term Mem¬ 
ory (LSTM) units dSl. Our work draws upon ideas from 



Fig. 2: Variations in the data set. Images from the data set HD 
for a left lane change. (Left) Views from the road facing camera. 
(Right) Driving style of the drivers vary for the same maneuver. 

previous works on RNNs and LSTM from the language (35l . 
speech ca, and vision M communities. Our approach to the 
joint training of multiple RNNs is related to the recent work 
on hierarchical RNNs CD. We consider RNNs in multi¬ 
modal setting, which is related to the recent use of RNNs in 
image-captioning El. Our contribution lies in formulating 
activity anticipation in a deep learning framework using 
RNNs with LSTM units. We focus on sensory-rich robotics 
applications, and our architecture extends previous works 
on sensory-fusion from feed-forward networks (^ [34l to 
the fusion of temporal streams. Using our architecture we 
demonstrate state-of-the-art results on maneuver anticipa¬ 
tion. 

III. Preliminaries 

We now formally define anticipation and then present 
our Recurrent Neural Network architecture. The goal of 
anticipation is to predict an event several seconds before it 
happens given the contextual information up to the present 
time. The future event can be one of multiple possibilities. 
At training time a set of temporal sequences of observations 
and events {(xi, X 2 , ^T)j,yj}f=i is provided where x^ is 
the observation at time t, y is the representation of the event 
(described below) that happens at the end of the sequence at 
t = T, and j is the sequence index. At test time, however, the 
algorithm receives an observation x^ at each time step, and 
its goal is to predict the future event as early as possible, 
i.e. by observing only a partial sequence of observations 
{(xi,..., Xt)|t < T}. This differentiates anticipation from 
activity recognition (3^ \T2\ where in the latter the complete 
observation sequence is available at test time. In this paper, 
Xt is a real-valued feature vector and y = [^^, is a 

vector of size K (the number of events), where denotes 
the probability of the temporal sequence belonging to event 
the k such that Ylk=i ~ 1. At the time of training, 
y takes the form of a one-hot vector with the entry in y 
corresponding to the ground truth event as 1 and the rest 0. 

In this work we propose a deep RNN architecture with 
Long Short-Term Memory (LSTM) units CD for anticipa¬ 
tion. Below we give an overview of the standard RNN and 
LSTM which form the building blocks of our architecture 
described in Section EYl 














A. Recurrent Neural Networks 

A standard RNN 1291 takes in a temporal sequence of 
vectors (xi, X 2 ,x^) as input, and outputs a sequence 
of vectors (hi, h 2 ,Iit) also known as high-level repre¬ 
sentations. The representations are generated by non-linear 
transformation of the input sequence from t = 1 to T, as 
described in the equations below. 

ht = /(Wx* + Hh*_i + b) (1) 

yt = softmax(Wyht + bj^) (2) 

where / is a non-linear function applied element-wise, and 
Yt is the softmax probabilities of the events having seen 
the observations up to x^. W, H, b, W^, are the 
parameters that are learned. Matrices are denoted with bold, 
capital letters, and vectors are denoted with bold, lower-case 
letters. In a standard RNN a common choice for / is tanh 
or sigmoid. RNNs with this choice of / suffer from a 
well-studied problem of vanishing gradients l29l , and hence 
are poor at capturing long temporal dependencies which are 
essential for anticipation. A common remedy to vanishing 
gradients is to replace tanh non-linearities by Long Short- 
Term Memory cells ca. We now give an overview of LSTM 
and then describe our model for anticipation. 

B. Long-Short Term Memory Cells 

LSTM is a network of neurons that implements a memory 
cell ca. The central idea behind LSTM is that the memory 
cell can maintain its state over time. When combined with 
RNN, LSTM units allow the recurrent network to remember 
long term context dependencies. 

LSTM consists of three gates - input gate i, output gate 
o, and forget gate f - and a memory cell c. See Figure 
for an illustration. At each time step t, LSTM first computes 
its gates’ activations {it,ft} and updates its memory 

cell from Ct-i to Ct •HI*. it then computes the output gate 
activation Ot <0, and finally outputs a hidden representation 
ht 0. The inputs into LSTM are the observations Xt and 
the hidden representation from the previous time step ht-i. 


LSTM applies the following set of update operations: 

it = cr(W^Xt + U^ht-i + V^Ct-i + b^) (3) 

ft = cr(W/Xt + U/ht-i + V/Ct-i +b/) (4) 

Ct = ft 0 Ct-i + it © tanh(WcXt + Ucht_i + b^) (5) 

Ot = cr(WoXt + Uoht-i + VoCt + bo) (6) 

ht = Ot 0 tanh(ct) (7) 


where 0 is an element-wise product and a is the logistic 
function, a and tanh are applied element-wise. W*, V*, 
U*, and b* are the parameters. The input and forget gates 
of LSTM participate in updating the memory cell 0. More 
specifically, forget gate controls the part of memory to forget, 
and the input gate computes new values based on the current 
observation that are written to the memory cell. The output 
gate together with the memory cell computes the hidden 
representation 0. Since LSTM cell activation involves sum¬ 
mation over time 0 and derivatives distribute over sums, the 
gradient in LSTM gets propagated over a longer time before 
vanishing. In the standard RNN, we replace the non-linear / 
in equation 0 by the LSTM equations given above in order 
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Fig. 3: Internal working of an LSTM unit. 


to capture long temporal dependencies. We use the following 
shorthand notation to denote the recurrent LSTM operation. 

(h,,c,)=LSTM(x,,h,_i,Ct-i) (8) 

We now describe our RNN architecture with LSTM units 
for anticipation. Following which we will describe a particu¬ 
lar instantiation of our architecture for maneuver anticipation 
where the observations x come from multiple sources. 

IV. Network Architecture for Anticipation 

In order to anticipate, an algorithm must learn to pre¬ 
dict the future given only a partial temporal context. This 
makes anticipation challenging and also differentiates it from 
activity recognition. Previous works treat anticipation as a 
recognition problem ED |27l ED and train discriminative 
classifiers (such as SVM or CRF) on the complete temporal 
context. However, at test time these classifiers only observe 
a partial temporal context and make predictions within a 
filtering framework. We model anticipation with a recurrent 
architecture which unfolds through time. This lets us train 
a single classifier that learns how to handle partial temporal 
context of varying lengths. 

Furthermore, anticipation in robotics applications is chal¬ 
lenging because the contextual information can come from 
multiple sensors with different data modalities. Examples 
include autonomous vehicles that reason from multiple sen¬ 
sors E) or robots that jointly reason over perception and 
language instructions 1261 . In such applications the way 
information from different sensors is fused is critical to the 
application’s final performance. For example Jain et al. 
showed that for maneuver anticipation, learning a simple 
transformation of the sensory streams works better than 
direct concatenation of those streams. We therefore build an 
end-to-end deep learning architecture which jointly learns to 
anticipate and fuse information from different sensors. 

A. RNN with LSTM units for anticipation 

At the time of training, we observe the complete temporal 
observation sequence and the event {(xi, X 2 ,..., x^), y}. 
Our goal is to train a network which predicts the fu¬ 
ture event given a partial temporal observation sequence 
{(xi, X 2 ,Xt)|t < T}. We do so by training an RNN 
in a sequence-to-sequence prediction manner. Given train¬ 
ing examples {(xi, X 2 ,..., x^) j, Yj we train an RNN 

with LSTM units to map the sequence of observations 
(xi, X 2 ,..., Xt ) to the sequence of events (yi,..., yr) such 
that yt = y, Vt, as shown in Fig. Trained in this manner, 
our RNN will attempt to map all sequences of partial obser¬ 
vations (xi,X 2 , ...,xt) Vt < T to the future event y. This 
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Fig. 4: Sensory fusion RNN for anticipation. (Bottom) In the 

Fusion-RNN each sensory stream is passed through their indepen¬ 
dent RNN. (Middle) High-level representations from RNNs are 
then combined through a fusion layer. (Top) In order to prevent 
over-fitting early in time the loss exponentially increases with time. 


way our model explicitly learns to anticipate. We additionally 
use LSTM units which prevents the gradients from vanishing 
and allows our model to capture long temporal dependencies 
in human activities Q 


B. Fusion-RNN: Sensory fusion RNN for anticipation 

We now present an instantiation of our RNN architecture 
for fusing two sensory streams: {(xi, ...,xt), (zi, ...,zt)}. 
In Sections |V] and |VlJ we use the fusion architecture for 
maneuver anticipation. 

An obvious way to allow sensory fusion in the RNN is by 
concatenating the streams, i.e. using ([xi; zi], ..., [x^; z^]) 
as input to the RNN. However, we found that this 
sort of simple concatenation performs poorly. We in¬ 
stead learn a sensory fusion layer which combines 
the high-level representations of sensor data. Our pro¬ 
posed architecture first passes the two sensory streams 
{ (xi,.. ., XT ) , (z 1 ,..., zt )} independently through separate 
RNNs (|^ and ( p^ . The high level representations from both 
RNNs {(hf,..., hf.), (hf,...,hf.) are then concatenated at 
each time step t and passed through a fully connected 
(fusion) layer which fuses the two representations ( pHj ), as 
shown in Figure]^ The output representation from the fusion 
layer is then passed to the softmax layer for anticipation ( p^ . 
The following operations are performed from t = 1 to T. 

(h^,cf) =LSTM,(x„h?^_i,<_i) (9) 

(hf,cf) =LSTM,(z,,hf_i,cf_i) (10) 

Sensory fusion: = tanh(W/[hf; h^] +b/) (11) 

yt = softmax(W^et + b^) (12) 

where W* and b>^ are model parameters, and LSTM^^ 
and LSTM;^ process the sensory streams (xi,...,xt) and 
(zi,...,zt) respectively. The same framework can be ex¬ 
tended to handle more sensory streams. 

C. Exponential loss-layer for anticipation. 

We propose a new loss layer which encourages the ar¬ 
chitecture to anticipate early while also ensuring that the 


^Driving maneuvers can take up to 6 seconds and the value of T can go 
up to 150 with a camera frame rate of 25 fps. 


architecture does not over-fit the training data early enough in 
time when there is not enough context for anticipation. When 
using the standard softmax loss, the architecture suffers a 
loss of — log(^^) for the mistakes it makes at each time 
step, where is the probability of the ground truth event k 
computed by the architecture using Eq. We propose to 
modify this loss by multiplying it with an exponential term 
as illustrated in Figure Under this new scheme, the loss 
exponentially grows with time as shown below. 

N T 

loss = EE- e-(^-*)log(2/'=) (13) 

j = l t=l 

This loss penalizes the RNN exponentially more for the mis¬ 
takes it makes as it sees more observations. This encourages 
the model to fix mistakes as early as it can in time. The loss in 
equationalso penalizes the network less on mistakes made 
early in time when there is not enough context available. This 
way it acts like a regularizer and reduces the risk to over-fit 
very early in time. 


D. Model training and data augmentation 

Our architecture for maneuver anticipation has more than 
25,000 parameters that need to be learned (Section 0. 
With such a large number of parameters on a non-convex 
manifold, over-fitting becomes a major challenge. We there¬ 
fore introduce redundancy in the training data which acts 
like a regularizer and reduces over-fitting 1^ na. In or¬ 
der to augment training data, we extract sub-sequences 
of temporal observations. Given a training example with 
two temporal sensor streams {(xi,..., xt), (zi,..., zt), y}, 
we uniformly randomly sample multiple sub-sequences 
{(Xj, (Zj, ...,Zj),y|l < i < j < T} as additional 

training examples. It is important to note that data augmenta¬ 
tion only adds redundancy and does not rely on any external 
source of new information. 


On the augmented data set, we train the network described 
in Section |IV-B[ We use RMSprop gradients which have 
been shown to work well on training deep networks 171, 
and we keep the step size fixed at 10“^. We experimented 
with different variants of softmax loss, and our proposed 
loss-layer with exponential growth Eq. in works best for 
anticipation (see Section 1^ for details). 


V. Context for maneuver anticipation 
In maneuver anticipation the goal is to anticipate the 
driver’s future maneuver several seconds before it hap¬ 
pens (191 123. The contextual information for anticipation 
is extracted from sensors installed in the vehicle. Previous 
work from Jain et al. considers the context from a driver 
facing camera, a camera facing the road in front, a Global 
Positioning System (GPS), and an equipment for recording 
the vehicle’s dynamics. The overall contextual information 
from the sensors is grouped into: (i) the context from inside 
the vehicle, which comes from the driver facing camera and 
is represented as temporal sequence of features (xi, ...,xt); 
and (ii) the context from outside the vehicle, which comes 
from remaining sensors and is represented as (zi, ...,zt). 

We improve the pipeline from Jain et al. HD with our 
deep learning architecture and new features for maneuver 
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Fig. 5: Maneuver anticipation pipeline. Temporal context in maneuver anticipation comes from cameras facing the driver and the road, 
the GPS, and the vehicle’s dynamics, (a, b and c) We improve upon the vision pipeline from Jain et al. GH by tracking 68 landmark 
points on the driver’s face and including the 3D head-pose features, (d) Using the Fusion-RNN we combine the sensory streams of features 
from inside the vehicle (driver facing camera), with the features from outside the vehicle (road camera, GPS, vehicle dynamics), (e) The 
model outputs the predicted probabilities of future maneuvers. 



(a) Features using KLT Tracker 2D Trajectories 



(b) Features using CLNF Tracker 2D Trajectories 


Fig. 6: Improved features for maneuver anticipation. We track 
facial landmark points using the CLNF tracker ||3 which results in 
more consistent 2D trajectories as compared to the KLT tracker Ea 
used by Jain et al. (19]. Furthermore, the CLNF also gives an 
estimate of the driver’s 3D head pose. 

anticipation. Figure shows our complete pipeline. In order 
to anticipate maneuvers, our RNN architecture (Figure 
processes the temporal context {(xi, (zi, ...,zt)} at 

every time step t, and outputs softmax probabilities yt for 
the following five maneuvers: M. = {left turn, right turn, left 
lane change, right lane change, straight driving}. We now 
give an overview of the feature representation used by Jain 
et al. HD and then describe our features which significantly 
improve the performance. 

A. Features for maneuver anticipation 

In the vision pipeline of Jain et al. CD, the driver 
facing camera detects discriminative points on the driver’s 
face and tracks the detected points across frames using the 
KLT tracker |[32l . The tracking generates 2D optical fiow 
trajectories in the image plane. From these trajectories the 
horizontal and angular movements of the face are extracted, 
and these movements are binned into histogram features for 
every frame. These histogram features are aggregated over 20 
frames (i.e. 0.8 seconds of driving) and constitute the feature 
vector Xt G Further context for anticipation comes 
from the camera facing the road, the GPS, and the vehicle’s 
dynamics. This is denoted by the feature vector Zt G Mf, 
and includes the lane information, the road artifacts such as 
intersections, and the vehicle’s speed. We refer the reader to 
Jain et al. iEi for more information on their features. 


B. 3D head pose and facial landmark features 

We now propose new features for maneuver anticipation 
which significantly improve upon the features from Jain et 
al. CD. Instead of tracking discriminative points on the 
driver’s face we use the Constrained Local Neural Field 
(CLNF) model O) and track 68 fixed landmark points on 
the driver’s face. CLNF is particularly well suited for driving 
scenarios due its ability to handle a wide range of head pose 
and illumination variations. As shown in Figure CLNF 
offers us two distinct benefits over the features from Jain 
et al. CH (i) while discriminative facial points may change 
from situation to situation, tracking fixed landmarks results in 
consistent optical flow trajectories which adds to robustness; 
and (ii) CLNF also allows us to estimate the 3D head pose 
of the driver’s face by minimizing error in the projection of a 
generic 3D mesh model of the face w.r.t. the 2D location of 
landmarks in the image. The histogram features generated 
from the optical flow trajectories along with the 3D head 
pose features (yaw, pitch and row), give us x^ G 

In Sectio n [Vl| we present results with the features from 
Jain et al. I19l / as well as the results with our improved 
features obtained from the CLNF model. 

VI. Experiments 

We evaluate our proposed architecture on the task of 
maneuver anticipation d mi |27l |36l. This is an impor¬ 
tant problem because in the US alone 33,000 people die 
in road accidents every year - a majority of which are 
due to dangerous maneuvers. Advanced Driver Assistance 
Systems (ADAS) have made driving safer by alerting drivers 
whenever they commit a dangerous maneuver. Unfortunately, 
many accidents are unavoidable because by the time drivers 
are alerted it is already too late. Maneuver anticipation can 
avert many accidents by alerting drivers before they perform 
a dangerous maneuver (301 . 

We evaluate on the driving data set publicly released by 
Jain et al. im. The data set consists of 1180 miles of 
natural freeway and city driving collected from 10 drivers 
over a period of two months. It contains videos with both 
inside and outside views of the vehicle, the vehicle’s speed, 
and GPS coordinates. The data set is annotated with 700 
events consisting of 274 lane changes, 131 turns, and 295 
randomly sampled instances of driving straight. Each lane 





































change and turn is also annotated with the start time of 
the maneuver, i.e. right before the wheel touches the lane 
marking or the vehicle yaws at the intersection, respectively. 
We augment the data set using the technique described 
in Section IV-D and generate 2250 events from the 700 
original events. We train our deep learning architectures 
on the augmented data. We will make our code for data 
augmentation and maneuver anticipation publicly available 
at: http://www.brain4cars.com 

We compare our deep RNN architecture with the following 
baseline algorithms: 


1) Chance: Uniformly randomly anticipates a maneuver. 

2) Random-forest: A discriminative classifier that learns 
an ensemble of 150 decision trees. 

3) SVM Xm : Support Vector Machine classifier used by 
Morris et al. IZ7l for maneuver anticipation. 

4) lOHMM l[T9\l : Input-Output Hidden Markov Model O 
used by Jain et al. US) for maneuver anticipation. 

5) AIO-HMM fdSl: This model extends lOHMM by in¬ 
cluding autoregressive connections in the output layer. 
AIO-HMM achieved state-of-the-art performance in 
Jain et al. mu. 


In order to study the effect of our design choices we also 
compare the following modifications of our architecture: 

6) Simple-RNN (S-RNN): In this architecture sensor 
streams are fused by simple concatenation and then 
passed through a single RNN with LSTM units. 

7) Fusion-RNN-Uniform-Loss (F-RNN-UL): In this ar¬ 
chitecture sensor streams are passed through separate 
RNNs, and the high-level representations from RNNs 
are then fused via a fully-connected layer. The loss at 
each time step takes the form — log(^^). 

8) Fusion-RNN-Exp-Loss (F-RNN-EL): This architecture 

is similar to F-RNN-UL, except that the loss exponen¬ 
tially grows with time — log(^^). 

We use the RNN and LSTM implementations provided by 
Jain ITSl. In our RNNs we use a single layer LSTM of size 64 
with sigmoid gate activations and tanh activation for hidden 
representation. Our fully connected fusion layer uses tanh 
activation and outputs a 64 dimensional vector. Our overall 
architecture (F-RNN-EL and F-RNN-UL) have nearly 25,000 
parameters that are learned using RMSprop m 

A. Evaluation setup 

We follow an evaluation setup similar to Jain et al. US). 
Algorithm shows the inference steps for maneuver antici- 
pation. At each time step t, features and are computed 
over the last 0.8 seconds of driving (20 frames). Using the 
temporal context {(xi,..., Xt), (zi,..., z^)}, each anticipa¬ 
tion algorithm computes the probability for maneuvers in 
AA. = {left lane change, right lane change, left turn, right 
turn, driving straight}. The prediction threshold is denoted 
by Pth ^ (0:1] in Algorithm The algorithm predicts 
driving straight if none of the softmax probabilities for the 
other maneuvers exceeds pth- 

In order to evaluate an anticipation algorithm, we compute 
the following quantities for each maneuver m G Af: (i) 


Algorithm 1 Maneuver anticipation 
Initialize m* = driving straight 

Input Features {(xi,..., x^), (zi,..., z^)} and prediction 
threshold pth 

Output Predicted maneuver m* 
while t = 1 to T do 

Observe features (xi,..., xt) and (zi,..., zt) 

Estimate probability yt of each maneuver in M 

m* = argmax^^^yt 

if m* 7 ^ driving straight & yt{m*} > pth then 
m* = ml 

^before — ^ t 

break 
end if 
end while 
Return m*, Uefore 


Nm- the total number of instances of maneuver m; (ii) 
TPm’. the number of instances of maneuver m correctly 
predicted by the algorithm; and (iii) Pm' the number of 
times the algorithm predicts m. Based on these quantities we 
evaluate the precision and recall of an anticipation algorithm 
as defined below: 


Pr — 


Re — 


\M\-\ 

1 

| M |-1 


E 


m G X4 \ { driving straight} 


E 


m G M. \ { driving straight} 


TPrr 


TPrr 

Nm 


(14) 

(15) 


We should note that driving straight maneuver is not 
included in evaluating precision and recall ( p~5] ). This 
is because anticipation algorithms by default predict driving 
straight when they are not confident about other maneuvers. 
For each anticipation algorithm, we choose a prediction 
threshold pth that maximizes their FI score: FI = 2 * Pr * 
Rel{Pr + Re). In addition to precision and recall, we also 
measure the interval between the time an algorithm makes a 
prediction and the start of maneuver. We refer to this as the 
time-to-maneuver and denote it with Uefore in Algorithm 
We uniformly randomly partition the data set into five folds 
and report results using 5-fold cross-validation. We train on 
four folds and test on the fifth fold, and report the metrics 
averaged over the five folds. 


B. Results 

We evaluate anticipation algorithms on the maneuvers 
not seen during training with the following three prediction 
settings: (i) Lane change: algorithms only anticipate lane 
changes, i.e. M. = {left lane change, right lane change, 
driving straight}. This setting is relevant for freeway driving; 
(ii) Turns: algorithms only anticipate turns, i.e. A4 = {left 
turn, right turn, driving straight}', and (iii) All maneuvers: 
algorithms anticipate all five maneuvers. Among these pre¬ 
diction settings, predicting all five maneuvers is the hardest. 

Table |I] compares the performance of the baseline antic¬ 
ipation algorithms and the variants of our deep learning 
model. All algorithms in Table evaluated on the 

features provided by Jain et al. Q9l, which ensures a fair 










TABLE I: Maneuver Anticipation Results. Average precision, recall and time-to-maneuver are computed from 5-fold cross-validation. 
Standard error is also shown. Algorithms are compared on the features from Jain et al. HU. 



Lane change 

Turns 

All maneuvers 

Method 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 


Chance 

33.3 

33.3 

- 

33.3 

33.3 

- 

20.0 

20.0 

- 


SVM 1271 

73.7 ± 3.4 

57.8 ± 2.8 

2.40 

64.7 ± 6.5 

47.2 ± 7.6 

2.40 

43.7 ± 2.4 

37.7 ± 1.8 

1.20 


Random-Forest 

71.2 ± 2.4 

53.4 ± 3.2 

3.00 

68.6 ± 3.5 

44.4 ± 3.5 

1.20 

51.9 ± 1.6 

27.7 ± 1.1 

1.20 


lOHMM [13 

81.6 ± 1.0 

79.6 ± 1.9 

3.98 

77.6 ± 3.3 

75.9 ± 2.5 

4.42 

74.2 ± 1.7 

71.2 ± 1.6 

3.83 


AIO-HMM tl3 

83.8 ± 1.3 

79.2 ± 2.9 

3.80 

80.8 ± 3.4 

75.2 ± 2.4 

4.16 

11A ± 2.3 

71.2 ± 1.3 

3.53 


S-RNN 

85.4 ± 0.7 

86.0 ± 1.4 

3.53 

75.2 ± 1.4 

75.3 ± 2.1 

3.68 

78.0 ± 1.5 

71.1 ± 1.0 

3.15 

Our 

F-RNN-UL 

92.7 ± 2.1 

84.4 ± 2.8 

3.46 

81.2 ± 3.5 

78.6 ± 2.8 

3.94 

82.2 ± 1.0 

75.9 ± 1.5 

3.75 

Methods 

F-RNN-EL 

88.2 ± 1.4 

86.0 ± 0.7 

3.42 

83.8 ± 2.1 

79.9 ± 3.5 

3.78 

84.5 ± 1.0 

77.1 ± 1.3 

3.58 


comparison. We observe that variants of our architecture 
outperform the previous state-of-the-art a majority of the 
time. This improvement in performance is because RNNs 
with LSTM units are very expressive models, and unlike 
Jain et al. CD they do not make any assumption about the 
generative nature of the problem. 


The performance of several variants of our architecture, 
reported in Table |T| justifies our design decisions to reach 
the final architecture as discussed here. S-RNN performs a 
very simple fusion by concatenating the two sensor streams. 
On the other hand, F-RNN models each sensor stream with 
a separate RNN and then uses a fully connected layer at 
each time step to fuse the high-level representations. This 
form of sensory fusion is more principled since the sensor 
streams represent different data modalities. Fusing high-level 
representations instead of concatenating raw features gives a 
significant improvement in performance, as shown in Table |T| 
When predicting all maneuvers, F-RNN-EL has a 6% higher 
precision and recall than S-RNN. 

As shown in Table exponentially growing the loss 
improves performance. Our new loss scheme penalizes the 
network proportional to the length of context it has seen. 
When predicting all maneuvers, we observe that F-RNN-EL 
shows an improvement of 2% in precision and recall over 
F-RNN-UL. We conjecture that exponentially growing the 
loss acts like a regularizes It reduces the risk of our network 
over-fitting early in time when there is not enough context 
available. Furthermore, the time-to-maneuver remains com¬ 
parable for F-RNN with and without exponential loss. 


We study the effect of our improved features in Table |I^ 
We replace the pipeline for extracting features from the 
driver’s face CD by a Constrained Local Neural Field 
(CLNF) model Our new vision pipeline tracks 68 facial 
landmark points and estimates the driver’s 3D head pose as 
described in Section |V-A[ We see a significant, 6% increase 
in precision and 10% increase in recall of F-RNN-EL when 
using features from our new vision pipeline. This increase 
in performance is attributed to the following reasons: (i) 
robustness of CLNF model to variations in illumination and 
head pose; (ii) 3D head-pose features are very informative 
for understanding the driver’s intention; and (iii) optical 
fiow trajectories generated by tracking facial landmark points 
represent head movements better, as shown in Figure 

The confusion matrix in Figure shows the precision 
for each maneuver. F-RNN-EL gives a higher precision 
than AIO-HMM on every maneuver when both algorithms 
are trained on same features (Eig. Izl)- Our new vision 
pipeline further improves the precision of E-RNN-EL on all 


TABLE II: 3D head-pose features. In this table we study the effect 
of better features with best performing algorithm from Table [I| in 
All maneuvers’ setting. We use 13 to track 68 facial landmark 
points and estimate 3D head-pose. 


Method 

Pr (%) 

Re (%) 

Time-to- 
maneuver (s) 

F-RNN-EL 

84.5 ± 1.0 

11.\ ± 1.3 

3.58 

F-RNN-EL w/ 3D head-pose 

90.5 ± 1.0 

87.4 ± 0.5 

3.16 



Threshold 

Eig. 8: Effect of prediction threshold pth» At test time an 
algorithm makes a prediction only when it is at least pth confident 
in its prediction. This plot shows how FI-score vary with change 
in prediction threshold. 

maneuvers (Fig. [^). Additionally, both F-RNN and AIO- 
HMM perform significantly better than previous work on 
maneuver anticipation by Morris et al. tT7\ (Fig. |7^). 

In Figure we study how FI-score varies as we change 
the prediction threshold pth- We make the following obser¬ 
vations: (i) The FI-score does not undergo large variations 
with changes to the prediction threshold. Hence, it allows 
practitioners to fairly trade-off between the precision and 
recall without hurting the FI-score by much; and (ii) the 
maximum FI-score attained by F-RNN-EL is 4% more than 
AIO-HMM when compared on the same features and 13% 
more with our new vision pipeline. In Tables and we 
used the threshold values which gave the highest El-score. 

VIE Conclusion 

In this work we addressed the problem of anticipating 
maneuvers several seconds before they happen. This problem 
requires the modeling of long temporal dependencies and 
the fusion of multiple sensory streams. We proposed a 
novel deep learning architecture based on Recurrent Neural 
Networks (RNNs) with Long Short-Term Memory (LSTM) 
units for anticipation. Our architecture learns to fuse multiple 
sensory streams, and by training it in a sequence-to-sequence 
prediction manner, it explicitly learns to anticipate using only 
a partial temporal context. We also proposed a novel loss 
layer for anticipation which prevents over-fitting. 

Our deep learning architecture outperformed the previous 
state-of-the-art on 1180 miles of natural driving data set. 
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F/g. 7; Confusion matrix of different algorithms when jointly predicting all the maneuvers. Predictions made by algorithms are represented 
by rows and actual maneuvers are represented by columns. Numbers on the diagonal represent precision. 


It improved the precision from 78% to 84.5% and recall 
from 71.1% to 77.1%. We further showed that improving 
head tracking and including the driver’s 3D head pose 
as a feature gives a significant boost in performance by 
increasing the precision to 90.5% and recall to 87.4%. We 
believe that our approach is widely applicable to many 
activity anticipation problems. As more anticipation data 
sets become publicly available, we expect to see a similar 
improvement in performance with our architecture. 
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